Estimating Evolutionary Distances between Sequences
نویسنده
چکیده
These notes accompany a lecture at the summer school on mathematics for bioinformatics, Centre de Recherches Mathematiques, Montreal, August 2003. I develop basic Markov models for sequence evolution and show how these may be used to estimate evolutionary divergences between sequences. I then discuss extensions of the basic model to situations where the evolutionary rate varies over time and for different sites. These notes are a sneak preview of a more detailed survey paper that is currently in preparation. Please respect this pre-publication status. 1 A discrete time model 1.1 Starting point sequences A genetic sequence is a string of finite length on a set of states which we number 1, 2, . . . , r (r = 4 for nucleotide sequences, r = 20 for amino sequences). The positions in the sequence are called sites. We assume that, for each i, the states in site i in each sequence evolved from the same common ancestor: the sites are homologous. To measure the distance between two sequences we could simply count the number of sites with differences. We would then lose hidden mutations. For example A → C → G would be counted as one mutation when there were actually two, and A→ C → G→ A would be counted as zero mutations when there were actually three. To estimate how many hidden mutations there were we use a Markov model. 1.2 Markov chain model We assume every site has the same probability distribution and this evolution is independent between different sites. Thus the probability of going from sequence A to sequence B (each with length m) over a certain time is given by P[A→ B] = m ∏ i=1 P[A[i]→ B[i]]. Because we can separate out the probability we need only focus on the probability of mutation for one site. We assume that time proceeds in discrete ‘ticks’ and study the evolution of an (ancestral) sequence A to the sequence B. We model the evolution of a site by a Markov Chain. Let R denote the transition matrix, so Rij equals the probability of a site being in state j after one unit of time (tick) conditional on that site starting in state i. For each k = 0, 1, 2, . . ., (R)ij is the probability of a site being in state j after k ticks conditional on that site starting in state i. 1.3 Stationary distributions We’ve already made many significant assumptions about the evolutionary process. Here are some more assumptions. 1. The Markov chain is irreducible every state can get to very other. 2. The chain is aperiodic we never get into loops we can’t get out of. 3. All of the states of the chain are ergodic as we run the chain to the limit, the probability of being in each state j is non-zero and independent of the starting states. That is, there are π1, . . . , πr, all positive, such that lim k→∞ (R)ij = πj The values π1, . . . , πr comprise a stationary distribution (also called the equilibrium distribution or equilibrium frequences) for the states. These satisfy πj = r ∑ i=1 πiRij . (1) If we sample the initial state from the stationary distribution, then run the chain for k steps, then the distribution of the final state will equal the stationary distribution. Equation (1) allows us to recover π1, . . . , πr just given the matrix R. Note that if we sample the initial state from any distribution, then run the chain to infinity, the final state will always have the stationary distribution. To see this, consider non-negative a1, a2, . . . , ar that sum to 1. Then lim k→∞ r ∑
منابع مشابه
An Evolutionary Relationship Between Stearoyl-CoA Desaturase (SCD) Protein Sequences Involved in Fatty Acid Metabolism
Background: Stearoyl-CoA desaturase (SCD) is a key enzyme that converts saturated fatty acids (SFAs) to monounsaturated fatty acids (MUFAs) in fat biosynthesis. Despite being crucial for interpreting SCDs’ roles across species, the evolutionary relationship of SCD proteins across species has yet to be elucidated. This study aims to present this evolutionary relationship based on amino aci...
متن کاملCo-evolution of metabolism and protein sequences.
The set of chemicals producible and usable by metabolic pathways must have evolved in parallel with the enzymes that catalyze them. One implication of this common historical path should be a correspondence between the innovation steps that gradually added new metabolic reactions to the biosphere-level biochemical toolkit, and the gradual sequence changes that must have slowly shaped the corresp...
متن کاملFactors affecting the errors in the estimation of evolutionary distances between sequences.
Phylogenetic methods that use matrices of pairwise distances between sequences (e.g., neighbor joining) will only give accurate results when the initial estimates of the pairwise distances are accurate. For many different models of sequence evolution, analytical formulae are known that give estimates of the distance between two sequences as a function of the observed numbers of substitutions of...
متن کاملEstimation of evolutionary distances under stationary and nonstationary models of nucleotide substitution.
Estimation of evolutionary distances has always been a major issue in the study of molecular evolution because evolutionary distances are required for estimating the rate of evolution in a gene, the divergence dates between genes or organisms, and the relationships among genes or organisms. Other closely related issues are the estimation of the pattern of nucleotide substitution, the estimation...
متن کاملEstimation of evolutionary distances between homologous nucleotide sequences.
By using two models of evolutionary base substitutions--"three-substitution-type" and "two-frequency-class" models--some formulae are derived which permit a simple estimation of the evolutionary distances (and also the evolutionary rates when the divergence times are known) through comparative studies of DNA (and RNA) sequences. These formulae are applied to estimate the base substitution rates...
متن کاملConstruction of evolutionary distance trees with TREECON for Windows: accounting for variation in nucieotide substitution rate among sites
Motivation: To improve the estimation of evolutionary distances between nucieotide sequences by considering the differences in substitution rates among sites. Results: TREECON for Windows (Van de Peer.Y. and De Wachter.R. Comput. Applic. Biosci., 9, 569-570, 1994) is a software package for the construction and drawing of phylogenetic trees based on distance data computed from nucleic acid and a...
متن کامل